edge query
Crowdsourced Clustering: Querying Edges vs Triangles
We consider the task of clustering items using answers from non-expert crowd workers. In such cases, the workers are often not able to label the items directly, however, it is reasonable to assume that they can compare items and judge whether they are similar or not. An important question is what queries to make, and we compare two types: random edge queries, where a pair of items is revealed, and random triangles, where a triple is. Since it is far too expensive to query all possible edges and/or triangles, we need to work with partial observations subject to a fixed query budget constraint. When a generative model for the data is available (and we consider a few of these) we determine the cost of a query by its entropy; when such models do not exist we use the average response time per query of the workers as a surrogate for the cost. In addition to theoretical justification, through several simulations and experiments on two real data sets on Amazon Mechanical Turk, we empirically demonstrate that, for a fixed budget, triangle queries uniformly outperform edge queries. Even though, in contrast to edge queries, triangle queries reveal dependent edges, they provide more reliable edges and, for a fixed budget, many more of them. We also provide a sufficient condition on the number of observations, edge densities inside and outside the clusters and the minimum cluster size required for the exact recovery of the true adjacency matrix via triangle queries using a convex optimization-based clustering algorithm.
Crowdsourced Clustering: Querying Edges vs Triangles
We consider the task of clustering items using answers from non-expert crowd workers. In such cases, the workers are often not able to label the items directly, however, it is reasonable to assume that they can compare items and judge whether they are similar or not. An important question is what queries to make, and we compare two types: random edge queries, where a pair of items is revealed, and random triangles, where a triple is. Since it is far too expensive to query all possible edges and/or triangles, we need to work with partial observations subject to a fixed query budget constraint. When a generative model for the data is available (and we consider a few of these) we determine the cost of a query by its entropy; when such models do not exist we use the average response time per query of the workers as a surrogate for the cost. In addition to theoretical justification, through several simulations and experiments on two real data sets on Amazon Mechanical Turk, we empirically demonstrate that, for a fixed budget, triangle queries uniformly outperform edge queries. Even though, in contrast to edge queries, triangle queries reveal dependent edges, they provide more reliable edges and, for a fixed budget, many more of them. We also provide a sufficient condition on the number of observations, edge densities inside and outside the clusters and the minimum cluster size required for the exact recovery of the true adjacency matrix via triangle queries using a convex optimization-based clustering algorithm.
- Asia > Middle East > Jordan (0.04)
- North America > United States > Colorado > El Paso County > Colorado Springs (0.04)
- North America > United States > California (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
- Information Technology > Communications > Social Media > Crowdsourcing (0.84)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
Crowdsourced Clustering: Querying Edges vs Triangles Babak Hassibi Department of Electrical Engineering Department of Electrical Engineering Caltech, Pasadena
We consider the task of clustering items using answers from non-expert crowd workers. In such cases, the workers are often not able to label the items directly, however, it is reasonable to assume that they can compare items and judge whether they are similar or not. An important question is what queries to make, and we compare two types: random edge queries, where a pair of items is revealed, and random triangles, where a triple is. Since it is far too expensive to query all possible edges and/or triangles, we need to work with partial observations subject to a fixed query budget constraint. When a generative model for the data is available (and we consider a few of these) we determine the cost of a query by its entropy; when such models do not exist we use the average response time per query of the workers as a surrogate for the cost. In addition to theoretical justification, through several simulations and experiments on two real data sets on Amazon Mechanical Turk, we empirically demonstrate that, for a fixed budget, triangle queries uniformly outperform edge queries. Even though, in contrast to edge queries, triangle queries reveal dependent edges, they provide more reliable edges and, for a fixed budget, many more of them. We also provide a sufficient condition on the number of observations, edge densities inside and outside the clusters and the minimum cluster size required for the exact recovery of the true adjacency matrix via triangle queries using a convex optimization-based clustering algorithm.
- Asia > Middle East > Jordan (0.04)
- North America > United States > Colorado > El Paso County > Colorado Springs (0.04)
- North America > United States > California (0.04)
- (2 more...)
- Information Technology > Communications > Social Media > Crowdsourcing (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)
Graph Embeddings via Tensor Products and Approximately Orthonormal Codes
We propose a dynamic graph representation method, showcasing its rich representational capacity and establishing some of its theoretical properties. Our representation falls under the bind-and-sum approach in hyperdimensional computing (HDC), and we show that the tensor product is the most general binding operation that respects the superposition principle employed in HDC. We also establish some precise results characterizing the behavior of our method, including a memory vs. size analysis of how our representation's size must scale with the number of edges in order to retain accurate graph operations. True to its HDC roots, we also compare our graph representation to another typical HDC representation, the Hadamard-Rademacher scheme, showing that these two graph representations have the same memory-capacity scaling. We establish a link to adjacency matrices, showing that our method is a pseudo-orthogonal generalization of adjacency matrices. In light of this, we briefly discuss its applications toward a dynamic compressed representation of large sparse graphs.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- (3 more...)
Random Subgraph Detection Using Queries
Huleihel, Wasim, Mazumdar, Arya, Pal, Soumyabrata
The planted densest subgraph detection problem refers to the task of testing whether in a given (random) graph there is a subgraph that is unusually dense. Specifically, we observe an undirected and unweighted graph on $n$ nodes. Under the null hypothesis, the graph is a realization of an Erd\H{o}s-R\'{e}nyi graph with edge probability (or, density) $q$. Under the alternative, there is a subgraph on $k$ vertices with edge probability $p>q$. The statistical as well as the computational barriers of this problem are well-understood for a wide range of the edge parameters $p$ and $q$. In this paper, we consider a natural variant of the above problem, where one can only observe a small part of the graph using adaptive edge queries. For this model, we determine the number of queries necessary and sufficient for detecting the presence of the planted subgraph. Specifically, we show that any (possibly randomized) algorithm must make $\mathsf{Q} = \Omega(\frac{n^2}{k^2\chi^4(p||q)}\log^2n)$ adaptive queries (on expectation) to the adjacency matrix of the graph to detect the planted subgraph with probability more than $1/2$, where $\chi^2(p||q)$ is the Chi-Square distance. On the other hand, we devise a quasi-polynomial-time algorithm that detects the planted subgraph with high probability by making $\mathsf{Q} = O(\frac{n^2}{k^2\chi^4(p||q)}\log^2n)$ non-adaptive queries. We then propose a polynomial-time algorithm which is able to detect the planted subgraph using $\mathsf{Q} = O(\frac{n^3}{k^3\chi^2(p||q)}\log^3 n)$ queries. We conjecture that in the leftover regime, where $\frac{n^2}{k^2}\ll\mathsf{Q}\ll \frac{n^3}{k^3}$, no polynomial-time algorithms exist. Our results resolve two questions posed in \cite{racz2020finding}, where the special case of adaptive detection and recovery of a planted clique was considered.
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > India > Karnataka > Bengaluru (0.04)
Planted Bipartite Graph Detection
Rotenberg, Asaf, Huleihel, Wasim, Shayevitz, Ofer
We consider the task of detecting a hidden bipartite subgraph in a given random graph. Specifically, under the null hypothesis, the graph is a realization of an Erd\H{o}s-R\'{e}nyi random graph over $n$ vertices with edge density $q$. Under the alternative, there exists a planted $k_{\mathsf{R}} \times k_{\mathsf{L}}$ bipartite subgraph with edge density $p>q$. We derive asymptotically tight upper and lower bounds for this detection problem in both the dense regime, where $q,p = \Theta\left(1\right)$, and the sparse regime where $q,p = \Theta\left(n^{-\alpha}\right), \alpha \in \left(0,2\right]$. Moreover, we consider a variant of the above problem, where one can only observe a relatively small part of the graph, by using at most $\mathsf{Q}$ edge queries. For this problem, we derive upper and lower bounds in both the dense and sparse regimes.
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
A Framework for Searching in Graphs in the Presence of Errors
Dereniowski, Dariusz, Tiegel, Stefan, Uznański, Przemysław, Wolleb-Graf, Daniel
We consider the problem of searching for an unknown target vertex $t$ in a (possibly edge-weighted) graph. Each \emph{vertex-query} points to a vertex $v$ and the response either admits $v$ is the target or provides any neighbor $s\not=v$ that lies on a shortest path from $v$ to $t$. This model has been introduced for trees by Onak and Parys [FOCS 2006] and for general graphs by Emamjomeh-Zadeh et al. [STOC 2016]. In the latter, the authors provide algorithms for the error-less case and for the independent noise model (where each query independently receives an erroneous answer with known probability $p<1/2$ and a correct one with probability $1-p$). We study this problem in both adversarial errors and independent noise models. First, we show an algorithm that needs $\frac{\log_2 n}{1 - H(r)}$ queries against \emph{adversarial} errors, where adversary is bounded with its rate of errors by a known constant $r<1/2$. Our algorithm is in fact a simplification of previous work, and our refinement lies in invoking amortization argument. We then show that our algorithm coupled with Chernoff bound argument leads to an algorithm for independent noise that is simpler and with a query complexity that is both simpler and asymptotically better to one of Emamjomeh-Zadeh et al. [STOC 2016]. Our approach has a wide range of applications. First, it improves and simplifies Robust Interactive Learning framework proposed by Emamjomeh-Zadeh et al. [NIPS 2017]. Secondly, performing analogous analysis for \emph{edge-queries} (where query to edge $e$ returns its endpoint that is closer to target) we actually recover (as a special case) noisy binary search algorithm that is asymptotically optimal, matching the complexity of Feige et al. [SIAM J. Comput. 1994]. Thirdly, we improve and simplify upon existing algorithm for searching of \emph{unbounded} domains due to Aslam and Dhagat [STOC 1991].
- Europe > Poland (0.28)
- North America > United States (0.28)
- Europe > Switzerland > Zürich > Zürich (0.14)
Crowdsourced Clustering: Querying Edges vs Triangles
Vinayak, Ramya Korlakai, Hassibi, Babak
We consider the task of clustering items using answers from non-expert crowd workers. In such cases, the workers are often not able to label the items directly, however, it is reasonable to assume that they can compare items and judge whether they are similar or not. An important question is what queries to make, and we compare two types: random edge queries, where a pair of items is revealed, and random triangles, where a triple is. Since it is far too expensive to query all possible edges and/or triangles, we need to work with partial observations subject to a fixed query budget constraint. When a generative model for the data is available (and we consider a few of these) we determine the cost of a query by its entropy; when such models do not exist we use the average response time per query of the workers as a surrogate for the cost.
Crowdsourced Clustering: Querying Edges vs Triangles
Vinayak, Ramya Korlakai, Hassibi, Babak
We consider the task of clustering items using answers from non-expert crowd workers. In such cases, the workers are often not able to label the items directly, however, it is reasonable to assume that they can compare items and judge whether they are similar or not. An important question is what queries to make, and we compare two types: random edge queries, where a pair of items is revealed, and random triangles, where a triple is. Since it is far too expensive to query all possible edges and/or triangles, we need to work with partial observations subject to a fixed query budget constraint. When a generative model for the data is available (and we consider a few of these) we determine the cost of a query by its entropy; when such models do not exist we use the average response time per query of the workers as a surrogate for the cost. In addition to theoretical justification, through several simulations and experiments on two real data sets on Amazon Mechanical Turk, we empirically demonstrate that, for a fixed budget, triangle queries uniformly outperform edge queries. Even though, in contrast to edge queries, triangle queries reveal dependent edges, they provide more reliable edges and, for a fixed budget, many more of them. We also provide a sufficient condition on the number of observations, edge densities inside and outside the clusters and the minimum cluster size required for the exact recovery of the true adjacency matrix via triangle queries using a convex optimization-based clustering algorithm.
- Asia > Middle East > Jordan (0.04)
- North America > United States > Colorado > El Paso County > Colorado Springs (0.04)
- North America > United States > California (0.04)
- (2 more...)
- Information Technology > Communications > Social Media > Crowdsourcing (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)